Skip to main content

B-Tree Basic

B-Trees are one of the most popular Storage Data Structure for Storage Engines and Databases in the wild. This is due to it's nature of addressing issues like search by using logarithmic search in huge datasets - which can reduce searches in 4bi entries by only 32 op's, minimizing the amount of Disk IO we need to fetch certain datas.

Structure

A B-Tree is sorted data-structure belonging the Tree family - variant Balanced Tree - that consists of a Root Node, Internal Nodes and Leaf Nodes.

Each Internal Node contain a N Keys (a.k.a index-entries,separator-key or dividir cells) and N+1 Pointers, being the +1 a point to the next Internal Node Leaf Nodes should always be at the same depth in B-Tree to prevent Tree getting unbalanced

The Tree is divided by Index Entries or Separator-Key or Divider Cells. These Keys splits the tree into smaller sub-trees - which holds a certain subrange of keys K1 < 1 < K3

B-Trees are organized into pages (2kb to 16kb - or a page from disk which may change from OS & Storage combo). The term Node is interchangeable with Page

Occupancy is the relation between the amount of keys a Node holds and it's capacity - which characterizes the B-Tree Nodes due to it's high fanout nature.

A B-Tree delivers two essential assumptions about the data-structure:

  • High Fanout: which just means it grows horizontally - ie adding more internal nodes and just splitting/merging by necessity
  • Low Height: We only have 3 levels root, internal and leaf.

Algorithms

B-Tree Insert Algorithm

  • Perform B-Tree lookup to find Leaf Node and locate an insert point
  • Append the key and the value to the leaf node insertion point.
    • Updates only change the value.
  • In case of overflowed leaf nodes, ie exceeded maximum capacity, we split the leaf node
    • Split int case leaf node N+1 is reached
    • for Internal Nodes (N+1)+1 is reached.
B-Tree Insert (Splits) Promoted

A insert is done by allocating a new node and transferring half of the keys and values into it. Then adding it's first key and point to the parent node. Being split point the index where we split the data, we transfer data from split point++ into the new node. If needed, we'll propagate the splits up above.

B-Tree Lookup Algorithm

  1. From root node
  2. Perform BST comparing the searched KEY within the keys stored in Internal Nodes
  3. Find the first separator-key
  4. Access the InternalNode.subtree from that `separator-key
    1. Repeat the process of looking for equal separator-key until you reach a leaf-node
    2. If there's no key equal to your searched key and the left-most is lower, there's no key in your B-Tree

B-Tree and Disk Drives:

HDD: The most expensive operation for a HDD is a disk seeks which consists of manually moving the mechanical head of the HDD to a certain position - to enable reading certain sectors. After moving the mechanical head, reading is very cheap. Typical range of bytes read in each sector - 512bytes to 4kb

SSD: SSDs are structure into small segmented structures hierarchiachly (??) by 32/64 memory cells -> 1 string -> arrays -> (2kb to 16kb)pages -> blocks -> planes -> N die A Block usually has 64 to 512 pages. The smallest unit for a SSD is a page. (2kb to 16kb). HOWEVER writing can only occur in empty blocks - ie we cannot update data. So we need to empty a block (delete it) - which can only occur by erase block. Differently from HDD who drivers need to control the mechanical head to seek for data, SSD keeps a controller (FTL) that maps pageId to their physical location - and controls the entire hardware. Responsabilities from FTL:

  • Garbage Collection
  • Mapping page ids to location in-disk
  • Tracking
    • empty pages
    • written pages
    • discarded pages

Differences between SSD and HDD

Nowdays OS's abstract interactions with the storage by block devices abstraction and let us access data in chunks. Even if ask a small bit of data, at least a block is read from the storage. The main difference resides in using HDD in case you workload is built of sequential IO which benefits from HDD cheap read operations after a seek.

Sequential IO can be describe as the phenomemon of doing any kind of IO (W/R) in sequentia locality - ie in the next offset - preventing any expensive Disk Seeks